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Abstract 

Partial  occlusion  is  a  challenging  problem  in  object 
tracking.  In  online  visual  tracking,  it  is  the  critical  factor 
causing  drift.  To  address  this  problem,  we  propose  a  novel 
approach  using  a  co-training  framework  of  generative  and 
discriminative  trackers.  Our  approach  is  able  to  detect  the 
occluding  region  and  continuously  update  both  the  genera¬ 
tive  and  discriminative  models  using  the  information  from 
the  non-occluded part.  The  generative  model  encodes  all  of 
the  appearance  variations  using  a  low  dimension  subspace, 
which  helps  provide  a  strong  reacquisition  ability.  Mean¬ 
while,  the  discriminative  classifier,  an  online  support  vector 
machine,  focuses  on  separating  the  object  from  the  back¬ 
ground  using  a  Histograms  of  Oriented  Gradients  (HOG) 
feature  set.  For  each  search  window,  an  occlusion  likeli¬ 
hood  map  is  generated  by  the  two  trackers  through  a  co¬ 
decision  process.  If  there  is  disagreement  between  these 
two  trackers,  the  movement  vote  of  KLT  local  features  is 
used  as  a  referee.  Precise  occlusion  segmentation  is  per¬ 
formed  using  MeanShift.  Finally,  each  tracker  recovers  the 
occluded  part  and  updates  its  own  model  using  the  new  non- 
occluded  information.  Experimental  results  on  challenging 
sequences  with  different  types  of  objects  are  presented.  We 
also  compare  with  other  state-of-the-art  methods  to  demon¬ 
strate  the  superiority  and  robustness  of  our  tracking  frame¬ 
work. 


1.  Introduction 

Visual  tracking  is  an  important  and  challenging  problem 
in  computer  vision  with  various  practical  applications  such 
as  surveillance,  robotics,  human-computer  interfaces.  One 
of  the  difficult  issues  is  the  appearance  changes  which  may 
come  from  varying  viewpoints  and  illumination  conditions. 
Moreover,  they  can  be  also  caused  by  partial  occlusion,  a 
very  challenging  problem.  In  this  paper,  we  aim  to  track 
an  arbitrary  object  with  partial  occlusion  handling  using 


(a)  Some  partial  occlusion  cases  (b)  Occlusion  segmentation 
Figure  1.  Partial  occlusion  cases  and  occlusion  segmentation 


very  limited  initial  labeled  data.  The  appearance  models  are 
learned  online  using  both  a  generative  and  a  discriminative 
tracker. 

Discriminative  methods  focus  on  finding  a  decision 
boundary  to  separate  the  object  from  the  background  [9, 
3,  2].  Generative  trackers  instead  only  aim  at  encoding  the 
target  appearance.  Examples  are  the  histogram-based  meth¬ 
ods  [1,17]  which  are  simple  but  effective  at  solving  tracking 
problems.  Another  way  of  building  a  generative  appearance 
model  is  to  use  linear  subspaces  [21,  13]  which  gains  lots 
of  interest  from  researchers. 

It  is  established  that  discriminative  classifiers  obtain  bet¬ 
ter  performance  than  generative  models  if  there  is  enough 
training  data  [12].  However,  generative  methods  have 
higher  generalization  when  limited  data  is  provided  [18]. 
One  intuitive  way  of  improving  discriminative  and  genera¬ 
tive  methods  is  to  combine  them  together  in  a  hybrid  way. 
Several  methods  [15,  28]  have  followed  this  trend  by  “dis¬ 
criminative  training”  of  a  generative  model.  They  optimize 
a  convex  combination  of  the  generative  and  discriminative 
log  likelihood  functions  to  obtain  the  model.  Co-training, 
originally  presented  by  Blum  and  Mitchell  [4],  is  another 
way  to  combine  different  classifiers  and  has  been  applied  in 
tracking  [23,  29,  16]. 

However,  very  few  methods  explicitly  address  the  par¬ 
tial  occlusion  problem,  which  is  the  critical  factor  causing 
drift  in  visual  tracking  using  a  single  camera.  Most  of  the 
proposed  algorithms  try  to  avoid  partial  occlusion  by  us¬ 
ing  a  threshold  to  stop  updating  the  model  whenever  it  hap- 
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pens  [9,  29]:  When  one  updates  the  model  with  the  object 
appearance  including  occlusion,  we  learn  the  noise,  i.e.  the 
occluding  region.  However,  to  determine  such  a  threshold 
is  not  easy  and  depends  on  each  specific  sequence,  which 
is  not  applicable  in  practice.  Moreover,  trying  to  avoid  par¬ 
tial  occlusion  limits  the  tracker  from  following  the  target  af¬ 
ter  occlusion  when  its  appearance  gradually  changes  during 
that  time.  This  issue  can  be  solved  if  we  are  able  to  detect 
the  occlusion,  replace  it  with  the  learned  information,  and 
continue  updating  our  model.  It  helps  the  tracker  to  adapt  to 
the  partial  appearance  changes  without  learning  the  noise. 
Here,  we  assume  that  there  is  no  abrupt  appearance  change 
in  the  occluded  part  during  occlusion. 

To  address  this  issue,  Adam  et  al.  [1]  proposed 
fragments-based  tracker  (Frag-Track)  using  integral  his¬ 
togram.  The  method  simply  splits  the  image  patch  into 
rectangular  sub-regions  and  keeps  tracking  them  by  using 
spatial  information  and  patch  similarity  measurements.  Be¬ 
cause  only  static  appearance  template  and  color  information 
(or  gray-level  information)  are  used,  it  is  hard  to  tackle  chal¬ 
lenging  sequences  with  object  appearance  changes,  or  clut¬ 
tered  background  and  distracters.  Also,  the  method  only 
uses  a  simple  rectangle  representation  with  no  scaling  nor 
rotation,  which  is  not  practical  nor  descriptive  enough  in 
visual  tracking.  Pan  and  Hu  [19]  proposed  a  tracking  algo¬ 
rithm  which  analyzes  the  occlusion  by  exploiting  the  spa- 
tiotemporal  context  information.  The  final  decision  is  fur¬ 
ther  double  checked  by  the  reference  targets  and  motion 
constraints.  Even  though  the  performance  of  the  tracker  is 
promising,  it  depends  on  several  thresholds  which  are  not 
easy  to  set.  Moreover,  adaptive  template  matching,  mainly 
used  by  this  approach,  may  not  be  sophisticated  enough  for 
handling  challenging  situations  with  cluttered  background. 
Recently,  Kalal  et  al.  [11]  proposed  the  P-N  Tracker  us¬ 
ing  positive  and  negative  constraints  to  exploit  the  structure 
of  the  data  and  get  feedback  about  the  performance  of  the 
classifier;  however,  it  cannot  deal  with  partial  occlusion  ex¬ 
plicitly.  Not  directly  detecting  the  occlusion,  MILTrack  [3], 
proposed  by  Babenko  et  al.,  learns  multiple  instances  in  an 
online  manner  to  avoid  drifting  problem.  This  method  is  in¬ 
spired  from  Multiple  Instance  Boosting  proposed  by  Viola 
et  al.  [25],  which  considers  a  bag  of  samples  labeled  as  posi¬ 
tive  if  there  is  at  least  one  positive  sample,  otherwise  labeled 
as  negative.  Like  Frag-Track  [1]  and  PNTracker  [11],  it 
only  uses  simple  rectangular  shape  without  rotation.  Mean¬ 
while,  Woodley  et  al.  [27]  proposed  a  tracking  method  us¬ 
ing  online  feature  selection  and  a  local  generative  model 
with  occlusion  handling.  However,  it  cannot  handle  illumi¬ 
nation  changes  because  it  uses  a  local  generative  model  to 
detect  occlusion.  Moreover,  the  method  does  not  consid¬ 
ered  object  rotation  and  is  not  extensively  tested  in  different 
environment  and  situations  such  as  different  types  of  object, 
indoors  and  outdoors. 


(a)  Original  (b)  Discriminative  (c)  Generative 

Figure  2.  Partial  occlusion  observation  on  our  trackers 


(a)  image  (b)  KLT  (c)  Occlusion  (d)  Movement 
Figure  3.  KLT  movement  when  occlusion  occurs. 

Inspired  from  the  HOG-LBP  detector  with  partial  occlu¬ 
sion  handling  proposed  by  Wang  et  al.  [26],  which  has  pro¬ 
duced  very  impressive  results  on  pedestrian  detector,  and 
the  co-trained  generative  and  discriminative  trackers  [29] 
(Co-Tracker)  having  presented  very  robust  results  we  pro¬ 
pose  a  co-training  framework  of  generative  and  discrimi¬ 
native  trackers  with  partial  occlusion  handling.  We  make 
an  assumption  that  the  appearance  changes  smoothly  while 
partial  occlusions  create  abrupt  changes.  Our  method  will 
show  no  improvement  over  others  if  this  assumption  is  vio¬ 
lated. 

The  contribution  of  our  paper  is  three-fold:  1)  We  pro¬ 
pose  an  occlusion  detection  method  using  both  generative 
and  discriminative  models;  2)  The  movement  of  local  fea¬ 
ture  voting  process  is  implemented  to  detect  if  the  occlusion 
appears;  3)  An  occlusion  recovery  and  an  online  updating 
step  are  proposed  to  update  both  generative  and  discrimina¬ 
tive  models  based  on  the  non-occluded  part.  It  is  important 
to  emphasize  that  the  algorithm  can  deal  with  different  types 
of  objects  with  very  limited  labeled  data,  i.e.  the  object  be¬ 
ing  selected  in  the  first  frame  only. 

The  rest  of  this  paper  is  organized  as  follows.  The 
overview  of  our  approach  is  presented  in  Section  2.  The 
details  of  the  generative  and  discriminative  trackers  are  de¬ 
scribed  in  Section  3  and  Section  4.  The  movement  of  local 
feature  voting  process  is  then  presented  in  Section  5.  The 
experiments  are  shown  in  Section  6,  followed  by  conclu¬ 
sions  and  future  work. 

2.  Overview  of  our  approach 
2.1.  Motivation 

After  studying  the  classification  scores  of  the  linear 
SVM  on  the  INRIA  dataset  [6,  7],  Wang  et  al.  [26]  noted 
that  the  densely  extracted  blocks  of  HOG  feature  in  the  oc- 


eluded  area  uniformly  respond  to  the  linear  SVM  classifier 
with  negative  inner  products.  Even  though  there  is  differ¬ 
ence  between  detection  and  tracking,  we  observed  the  same 
effect  on  tracking  sequences  with  different  types  of  objects 
(Fig  2(b)).  We  also  investigated  the  response  of  a  generative 
model  under  partial  occlusion  and  observed  that  the  resid¬ 
ual  error  is  much  higher  in  the  occluded  area  (Fig  2(c)). 
Moreover,  the  strong  edge  between  the  object  and  the  oc¬ 
cluding  area  makes  the  majority  of  local  features  (here  we 
use  KFT  [22])  in  the  region  to  be  later  occluded  move  in  the 
same  direction  and  displacement  (Fig  3). 

These  observations  allow  us  to  design  a  framework  to 
detect  occlusion. 

2.2.  Overview 

The  overview  of  our  approach  is  illustrated  in  Fig  4.  A 
particle  filter  framework  [10]  is  used  for  sampling  to  es¬ 
timate  the  hidden  state  of  the  object  given  a  sequence  of 
observations. 

Denote  St  =  [x^y^px^py^O]  as  the  state  of  the  object 
where  {x^y)  is  the  center  of  the  tracking  box,  (px^Py)  is 
the  scale  w.r.t  the  predefined  size  of  the  object,  and  0  is 
the  in-plane  rotation  angle.  To  avoid  drifting,  the  tracker 
needs  to  find  the  object  with  an  accurate  center  position  at 
the  right  scale,  rotation.  At  frame  It,  the  result  given  by  the 
tracker  is  a  cropped  image  determined  by  the  state  of  the 
tracked  object.  Fet  Ot  =  (01,02,  ....Ot)  be  the  sequence  of 
observed  image  regions  over  time  t,  our  goal  is  to  find  the 
hidden  state  St.  Assuming  a  Markovian  state  transition,  a 
recursive  equation  is  applied  to  formulate  the  posterior: 

p{st\Ot)  (X.  p{ot\st)  j p{st\st-i)p{st-i\Ot-i)dst-i  (1) 

where  p{st-i\Ot-i)  is  the  posterior  distribution  from  all 
the  previous  observations  while  p{ot\st)  Sind p{st\st-i)  are 
the  observation  and  transition  model,  respectively.  The  crit¬ 
ical  issue  is  to  estimate  the  likelihood  of  the  new  observa¬ 
tion  given  the  posterior  distribution.  In  our  approach,  the 
likelihood  comes  from  two  independent  models.  One  is  the 
generative  model,  a  linear  subspace,  which  is  learned  on¬ 
line  to  encode  the  variations  in  appearance.  The  other  is  the 
discriminative  model  which  is  also  trained  in  online  manner 
using  HOG  feature  set  [6].  The  co-training  framework  helps 
these  two  models  train  each  other  from  the  beginning  when 
limited  initialization  is  provided.  Each  model  estimates  the 
occlusion  likelihood  of  each  block  in  a  sample;  and  they 
make  the  decision  together.  KFT  features  [22]  are  also  gen¬ 
erated  and  tracked  in  order  to  determine  when  occlusions 
happen  by  detecting  uncertain  region  through  a  movement 
voting  process.  Because  of  the  independence  of  these  ob¬ 
servers,  the  final  likelihood  result  is  the  dot  product  of  these 
likelihood  functions. 
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Figure  4.  Overview  of  our  approaeh 


3.  Discriminative  tracker  using  online  SVM 

We  adopt  FASVM  [5],  an  incremental  online  SVM,  to 
train  a  classifier  to  separate  object  from  background.  Here 
we  only  discuss  how  to  get  the  classifier  score  on  each 
block.  For  more  details  about  the  training  scheme,  see  [5]. 

3.1.  Conventional  learning 

The  decision  function  of  SVM  [24]  is 


tts'D 

f{x)  =  (3  +  '^akK{x,Xk)  (2) 

k=l 


where  x  is  a  sample  and  Xk  •  k  e  {1,  2, are 
the  support  vectors.  K{x^Xk)  is  the  kernel  function;  and 
f3  is  the  bias  constant.  Here,  a  linear  kernel  is  used,  which 
means  K{x^Xk)  is  the  inner  scalar  product  of  two  vectors 
in  To  use  the  observation  in  Section  2,  instead  of  com¬ 
puting  the  classification  score  for  the  whole  sample  patch, 
we  compute  that  of  each  block  to  infer  whether  partial  oc¬ 
clusion  occurs,  and  where  it  is.  It  is  important  to  note  that 
we  follow  the  way  of  splitting  block  when  computing  HOG 
features. 

Following  the  algorithm  of  Wang  et  al.  [26],  we  review 
the  derivation  to  obtain  the  distribution  of  the  bias  constant 
over  each  block,  then  formulate  it  in  online  manner  to  fit 
into  our  training  framework.  Due  to  the  linear  characteris¬ 
tics,  when  using  linear  kernel,  Eq.  2  is  rewritten  as: 

'^sv 

fix)=f3  +  X^.J2akXk=f3  +  W^.X  (3) 

k=l 
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where  W  =  a^Xk  = 


is  the  weighted  sum 


\  / 

of  support  vectors.  Now  we  have  to  distribute  the  bias  con¬ 
stant  (3  to  each  block  Bi]  so  that  the  contribution  score  of 
each  block  in  the  final  classifier  confidence  score  can  be 
computed  after  subtracting  that  local  bias  f3i  from  the  total 
feature  inner  production  over  that  block. 

For  consistency,  we  use  the  same  notation  as  in  [26].  Let 
us  denote  as  the  set  of  HOG  features  of  positive  sam¬ 
ples  and  x~  as  the  set  of  HOG  features  of  negative  sam¬ 
ples,  where  p  =  1 , . . . ,  N~^  (N~^  is  the  number  of  positive 
samples)  and  q  =  1, ...,  is  the  number  of  negative 

ones).  -  and  B~-  are  denoted  as  the  ith  blocks  of  x^ 
and  x~ ,  respectively. 

Let  ^4  =  —  ^  where  S~  and  5'+  are  the  summation 
classification  scores  of  the  positive  and  negative  samples. 


N+  nuk 

s+  =  Y^  /«)  =  + E  £  (4) 

p=l  p=l  i=l 


N  N  ribik 

q=l  q=l  i=l 

From  Eq.  4  and  Eq.  10,  with  the  factor  A,  we  have: 

nuk  /  JV+  ^~  \ 

A7V+/3  +  iV-/3  +  ^  .  A  ^  ^  (6) 

i=l  y  p=l  q=l  J 

which  can  be  written  as: 


'Tlblk 


N+  N- 

p=l  q=l 


i=l 


(7) 


where  B  =  —  ^  .  Now,  we  can  have  the  distribution 

of  bias  constant  on  each  block: 


N+  N- 

p=l  q=l 


(8) 


3.2.  Online  learning 

Up  to  this  point.  Pi  is  only  calculated  in  off-line  man¬ 
ner  when  all  the  training  samples,  i.e.  positive  and  nega¬ 
tive  ones,  are  known.  Now  we  consider  all  of  the  notations 
above  is  for  the  current  trained  model.  Assuming  we  have 
new  positive  samples  and  new  negative  ones. 

Now  we  have  N'~^  =  N~^  +  and  N  ~  =  N~  +  Nnew 
are  the  new  number  of  positive  and  negative  samples  in 
total,  respectively.  Because  of  the  independence  between 
blocks,  S  +  and  S  ~  are  computed  as  follows: 


N'+ 

^  B'blk 

s'+  =  Yi  fix'pA  =  N'+p'  +  YY^'i  -Kl 

(9) 
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where  P'  and  w'-  are 

5  output  of  LASVM  after  online  train- 

ing  new  samples;  B'+^  =  and  BX  = 

B-i  +  S- ^  are  the  updated  of  ith  blocks  in  positive 
and  negative  samples,  i.e.  x'^  and  x'~,  respectively.  Fol¬ 
lowing  the  same  computation,  we  have  A'  =  —§7+ 

TDf  _ _ 1 _ 

^  ~  A'.N'++N'- 

The  new  bias  constant  P[  for  each  block  is  updated  using 
Eq.  8  with  all  updated  parameters. 

The  occlusion  likelihood  map  is  generated  as  a  binary 
image  based  on  the  score  of  each  block  ,  which  is  0  if  the 
score  is  negative  and  1,  otherwise.  Each  pixel  in  the  likeli¬ 
hood  image  corresponds  to  a  block  in  the  sample. 

3.3.  Updating  the  model 

As  discussed  in  Section  2,  if  the  classifier  is  updated  in¬ 
cluding  the  occluding  region,  it  may  drift  because  the  noise 
(occluding  area)  becomes  part  of  the  model.  To  avoid  this 
issue,  the  non-occluded  part  is  kept  while  the  occluded  area 
is  inferred  from  a  previous  frame  (as  shown  in  Figure  5(c)). 
In  a  long-term  partial  occlusion,  we  can  consider  this  step  as 
a  recursive  process  where  the  occluded  area  of  the  object  in 
the  current  frame  is  projected  from  that  of  the  object  in  the 
previous  frame,  which  may  also  be  drawn  from  its  previous 
one. 

4.  Generative  tracker  using  linear  subspace 

Although  the  use  of  multiple  linear  subspaces  [29]  pro¬ 
duces  good  results  in  tracking,  the  ambiguity  is  high  when 


(a)  Some  partial  occlusion  cases  at  frame  28,  88,  and  185 


(b)  Generative  tracker  model  update 


(c)  Discriminative  tracker  model  update 

Figure  5.  Occlusion  recovery  from  our  trackers  (the  image  patch 
is  scaled  to  32x32  for  training) 


deciding  whether  to  create  a  new  subspace  and  merge  a  pair 
of  existing  ones  or  not.  It  is  even  more  ambiguous  when 
partial  occlusion  appears.  Some  new  subspaces  may  be  cre¬ 
ated,  which  do  not  reflect  the  correct  appearance  model  of 
the  object.  Also,  more  noise  is  included  in  the  model  by 
encoding  the  appearance  that  way. 

Here  we  propose  to  use  a  single  linear  subspace  to  ap¬ 
proximate  the  appearance  model  of  the  object.  This  is 
similar  to  the  incremental  visual  tracker  (IVT)  by  Ross  et 
al.  [21],  but  with  partial  occlusion  handling. 


4.1.  Online  learning 


In  the  initialization  step,  after  collecting  several  sam¬ 
ples  by  simple  template  matching,  we  train  the  model  of 
the  object  from  those  n  training  images  lini  =  {/i, ...,  In} 
by  computing  the  eigenvectors  U  of  the  covariance  matrix 

where  h  is  the  mean 

of  the  training  images.  It  can  be  solved  by  singular  value  de¬ 
composition  (SVD)  P  =  of  the  centered  data  ma¬ 

trix  [(/i-/)...(4-7)] 

Given  new  m  images  Tadd  =  the  sub¬ 

space  needs  to  be  incrementally  updated  by  calculating 
[P  Q]  where  Q  is  the  new  observation  matrix  according  to 
Tadd‘  As  the  result  of  the  derivation  in  [21],  we  have 


[P  Q]  =  ([[/  Q]U)  t  (v^ 


(11) 


In  which  Q  is  the  component  of  Q  orthogonal  to  U.  Fi¬ 
nally,  we  have  U'  =  \U  Q]U  and  S'  =  S  as  the  updated 


eigensystem.  In  our  implementation,  for  efiiciency,  the  top 
k  eigenvectors  (k  =  10)  are  maintained  to  represent  the 
model  of  the  learned  object. 

4.2.  Evaluation 


Given  a  subspace  Q  with  the  first  k  eigenvectors,  the 
projection  of  a  sample  x  on  (1  is  ^  = 

—  x).  Then  the  likelihood  of  x  can  be  expressed: 


pi-xin)  = 


exp(^-iE  S) 

e^p(  U) 

(27r)'=/2  n 

i=l 

(12) 


Where  is  the  eigenvalue  with  respect  to  i/i,  d  is  the  di¬ 
mension  of  the  input,  e{x)  =  \x  —  UU^x\  is  the  projection 
error.  The  parameter  p  =  approxi¬ 

mated  as  p  =  |A/c+i. 

However,  to  detect  partial  occlusion,  as  discussed  in  Sec¬ 
tion  2  and  shown  in  Fig.  2(a),  intuitively,  the  projection  error 
is  split  into  blocks,  the  same  way  done  in  the  discriminative 
tracker.  We  simply  compute  the  occlusion  likelihood  by  us¬ 
ing  the  projection  error  over  each  block. 

These  likelihood  values  are  then  normalized  to  generate 
the  occlusion  likelihood  map  which  is  a  binary  image.  The 
0  value  corresponds  to  the  block  having  score  lower  50%  of 
the  maximum  score  block,  and  1,  otherwise. 


4.3.  Updating  the  model 

To  avoid  modeling  the  occluding  part  when  partial  oc¬ 
clusion  occurs,  instead  of  updating  the  whole  image  patch 
as  described  in  Section  4.1,  we  propose  an  algorithm  to  re¬ 
cover  the  occluded  part.  Using  the  generative  model,  we 
project  the  information  encoded  in  the  learned  subspace 
onto  the  occluded  area  to  All  up  the  image  patch  (Figure  5) 
and  follow  the  online  learning  in  Section  4.1. 


5.  Local  features  movement  voting  using  KLT 

Taking  advantage  of  the  simplicity  and  fast  computation 
of  KLT  features  [22],  tracking  consistency  is  checked  based 
on  the  movement  of  these  features  in  the  object  region  at  ev¬ 
ery  frame.  Due  to  the  discontinuity  between  non-occluded 
and  occluding  regions,  some  KLT  features  are  driven  in  the 
same  direction  and  velocity  which  are  different  from  the  re¬ 
maining  part.  Taking  account  this  observation,  we  propose 
a  voting  scheme  on  the  movement  of  these  local  features  to 
detect  the  occlusion. 

After  being  detected  in  the  first  frame,  these  features  are 
tracked  in  every  frame.  After  removing  all  of  the  outliers, 
the  magnitude  displacement  of  each  feature  is  then  normal¬ 
ized  to  [0,1]  and  encoded  in  a  4-bin  histogram.  The  direc¬ 
tion  of  the  movement  is  encoded  in  a  8 -bin  histogram,  each 


(a)  Frame  'ill  (b)  Frame  373  (c)  Generative  (d)  Discriminative 

Figure  6.  Disagreement  in  oeelusion  deteetion  from  the  two  traek- 
ers. 


of  which  covers  a  J  span.  All  displacement  vectors,  thus, 
accumulate  into  a  4x8  2D  histogram. 

Let  R  be  the  candidate  occluded  region  voted  by  discrim¬ 
inative  and  generative  trackers,  H  =  {hi^j}  is  the  histogram 
of  all  the  KLT  features  in  the  current  frame,  H'  =  {h[  j} 
is  the  histogram  of  the  KLT  features  which  was  originally 
in  the  current  occluded  region.  Let  H  =  H  —  H'  hQ  the 
histogram  of  the  non-occluded  part.  We  have: 

h'max  =  argmaxih'ij)  (13) 

where  h',  a  ^  H' .  Let  us  call  h'^^  =  h'^  a  ,  the  con- 

traax’)Jmax 

dition  for  R  to  be  considered  as  occluded  part  is 


1  if 
0  otherwise 


(14) 


This  equation  can  be  understood  as  a  checking  process 
of  local  features  uncertainty  in  the  occluded  region.  When 
there  is  a  majority  of  KLT  features  in  a  region  having  dif¬ 
ferent  movement  behavior  than  the  rest,  partial  occlusion  is 
detected.  In  practice,  we  choose  6>  =  0.7.  It  is  important  to 
note  that  the  KLT  features  are  re-initialized  after  occlusion 


and  this  step  is  only  applied  as  a  referee  when  there  is  dis¬ 
agreement  on  occlusion  detection  between  the  two  genera¬ 
tive  and  discriminative  trackers.  An  example  of  occlusion 
detection  disagreement  between  these  two  trackers  is  shown 
in  Fig.  6.  This  disagreement  is  resolved  with  the  use  of  KLT 
feature  occlusion  detection  serving  as  a  referee. 


6.  Experiments 

6.1.  Implementation  Details 

To  implement  the  generative  and  discriminative  models, 
depending  on  the  size  ratio  of  the  object,  we  use  an  im¬ 
age  vector  of  size  32x32  for  square-shape  and  32x64  for 
rectangle-shape.  For  the  generative  tracker,  the  subspace  is 
maintained  by  the  top  k=10  eigenvectors  because  it  gives 
us  the  best  trade-off  in  precision  and  running  time.  Every  5 
frames,  we  update  the  subspace  once.  For  the  discriminative 
tracker,  we  use  the  linear  kernel  LASVM  [5]  with  R-HOG 
feature  set  [6]  (16x16  block  size  and  8x8  cell  size).  To  al¬ 
low  the  overlapping  HOG  descriptor,  we  use  the  step  size 
of  8.  For  a  block  we  have  36-bin  oriented  histogram.  Be¬ 
cause  of  the  growth  in  number  of  support  vectors,  a  sliding 
window  of  300  frames  is  applied  to  focus  on  the  current  ap¬ 
pearance  of  the  object.  In  the  first  frame,  we  manually  select 


the  object  and  apply  simple  template  matching  for  the  next 
4  frames.  These  initial  labeled  data  are  then  transferred  to 
both  generative  and  discriminative  trackers  for  training.  Our 
Bayesian  framework  generates  600  particles  at  each  frame. 

The  combined  tracker  is  implemented  in  C++  and  runs  at 
4fps  on  an  Intel  QuadCore  3.0GHz  system.  At  every  frame, 
each  of  the  trackers  independently  predicts  the  unlabeled 
data  based  on  its  trained  model.  Following  [29],  in  the  dis¬ 
criminative  tracker,  we  also  convert  score  of  SVM  to  proba¬ 
bility  output  [20]  and  follow  the  same  threshold  settings  for 
both  trackers. 

In  the  co-decision  step  to  combine  the  two  occlusion 
likelihood  maps,  we  simply  use  an  AND  operator  to  inte¬ 
grate  them  into  the  final  one.  However,  when  there  are  more 
than  70%  of  the  pixels  different  between  these  two  likeli¬ 
hood  maps,  we  use  local  features  movement  voting  process 
to  choose  the  detector  to  rely  on.  In  our  experiments,  it 
happens  mostly  when  there  is  local  change  causing  false 
occlusion  detection  by  generative  model. 

6.2.  Comparison 

We  tested  our  algorithm  on  several  challenging  pub¬ 
lished  video  sequences  of  different  types  of  objects  in  in¬ 
door  and  outdoor  environments.  Several  related  state-of- 
the-art  trackers  included  in  the  comparison  are  the  Co- 
Tracker  [29],  which  is  the  most  related  to  our  tracker,  the 
Frag-Tracker  [1],  the  Online  and  Semi-Boosting  Tracker 
(OAB,  SB)  [8,  9],  the  P-N  Tracker  (PNT)  [11],  the  MIL- 
Tracker  [3]  and  its  new  variation  with  no  regret  MIO 
Tracker  [14].  We  use  the  provided  results  and  pub¬ 
lished  source  code  from  the  authors^ .  These  methods 
were  also  demonstrated  on  published  benchmark  video  se¬ 
quences  for  comparison.  We  also  demonstrated  the  robust¬ 
ness  of  our  proposed  partial  occlusion  handling  co-training 
framework  by  comparing  our  tracker  with  each  component 
without  occlusion  handling.  For  the  MIL,  MIO,  OAB,  SB 
trackers,  we  use  the  settings  described  in  the  papers  with 
some  optimized  parameters  in  search  range  and  number  of 
selectors.  The  parameters  (except  the  search  range)  in  Frag- 
Tracker  are  kept  as  default.  In  the  Co-Tracker  and  the  two 
components  of  our  tracker,  the  parameters  are  set  the  same 
as  our  tracker.  To  prove  the  precision  of  our  tracker,  we 
used  the  same  measurement,  average  center  location  errors 
(in  pixels),  used  for  evaluation  in  [3,  14]. 

The  testing  sequences  include  five  videos  reported  by 
MILTracker  and  one  video  published  in  [30].  The  ground 
truth  centers  of  every  five  frames  are  also  provided  by 
Babenko  et  al} .  We  also  labeled  the  ground  truth  for  the 
new  sequence  in  the  same  manner.  The  resolution  of  all 
video  frames  is  320x240,  except  the  Occluded  Face  which 

^MILTracker:  http://vision.ucsd.edu/  babenko/project_miltrack.shtml 

^Frag-Tracker:  http://www.cs.technion.ac.il/amita/fragtrack/fragtrack.htm 

^  Semi-boosting  Tracker:  http://www.vision.ee.ethz.ch/boostingTrackers/index.htm 


Video  Sequence 

Frames 

GT 

DT 

FT 

OAB 

ST 

PNT 

MILT 

MIO 

CoT 

Ours 

Coke  Can 

292 

102 

9 

67 

25 

85 

8 

21 

22 

10 

8 

Occluded  Face  1 

900 

86 

17 

7 

44 

41 

8 

27 

14 

16 

5 

Occluded  Face  2 

808 

14 

12 

21 

21 

43 

8 

20 

13 

12 

7 

Person 

200 

35 

73 

44 

37 

154 

44 

34 

n/a 

33 

5 

Tiger  1 

354 

52 

6 

40 

35 

46 

13 

15 

24 

5 

4 

Tiger  2 

365 

43 

7 

37 

34 

53 

21 

17 

23 

7 

5 

Table  1.  Average  eenter  loeation  errors.  (GT:  Generative  Traeker,  DT:  Diseriminative  Traeker,  FT:  Frag-Traeker  [1],  OAB:  Online  Boosting 
Traeker  [8],  ST:  Semi-Boosting  Traeker  [9],  PNT:  P-N  Traeker  [11],  MILT:  MILTraeker  [3],  MIO:  MIL  No  Regret  Traeker  [14],  CoT:  Co- 
Traeker  [29] )  in  different  ehallenging  datasets.  The  best  performanee  is  in  bold,  the  seeond  best  is  in  italie. 


(a)  Occluded  Face  1 


(b)  Occluded  Face  2 


(c)  Person 


(d)  Tiger  2 


I - MILTraeker - Frag-Tracker  - Our  Tracker  | 

Figure  7.  Some  sereen  shots  from  the  testing  results.  Beeause  of  elarity  issue,  we  only  ehoose  Frag-Traeker  [1]  and  MILTraeker  [3]  to 
show  some  results  eomparing  with  our  traeker. 


is  352x288.  The  quantitative  comparison  results,  which  are 
presented  in  Table  1,  clearly  show  the  advantages  of  our 
approach.  “N/a”  is  reported  when  we  do  not  have  the  re¬ 
sults  from  that  method.  All  of  the  other  trackers  cannot 
adapt  well  to  the  object  appearance  changes  including  light¬ 
ing  change,  pose  change  (occluded  face  2)  and  fail  when 
total  occlusion  appears  in  seq.  “person”.  Also,  our  al¬ 
gorithm  helps  to  avoid  drift  shown  by  the  lowest  position 
error  compared  to  others  in  all  sequences.  All  of  the  se¬ 
quences  provide  long-term  and  heavy  partial  and  total  oc¬ 
clusions,  and  challenging  appearance  changes  such  as  il¬ 
lumination  changes,  abrupt  motion,  rotation,  and  cluttered 
backgrounds. 

Occluded  Face  and  Occluded  Face  2:  although  the  “Oc¬ 
cluded  Face”  sequence  contains  many  occlusion  cases,  the 
object  does  not  move  much  and  is  quite  distinctive  from 
the  background.  However,  it  is  a  good  example  for  us  to 
test  our  occlusion  detection.  Our  tracker  outperforms  oth¬ 
ers,  especially  Frag-Tracker,  which  is  proposed  to  solve  the 
partial  occlusion  problem  using  a  part-based  model.  The 
“Occluded  Face  2”  is  much  more  complicated,  it  contains 
illumination  change,  in-plane  rotation,  and  heavy  occlusion. 


Other  trackers  hardly  get  the  precise  position  for  the  center 
when  occlusion  and  rotation  happen  while  our  tracker  tracks 
the  face  consistently  with  very  good  rotation. 

Coke  Can,  Tiger,  and  Tiger2:  These  three  sequences  share 
the  same  challenges:  illumination  changes,  abrupt  motion, 
partial  occlusion,  and  rotation  changes.  The  generative 
model  is  not  very  effective  because  of  abrupt  changes  in 
appearance,  whereas  the  discriminative  one  obtains  excel¬ 
lent  performance. 

Person:  This  sequence  was  taken  outdoor  and  contains  to¬ 
tal  occlusion  [30].  While  Frag-Tracker  is  stuck  at  a  similar 
area  in  the  background  when  the  person  rotates  and  MIL- 
Tracker  cannot  handle  the  occlusion  when  another  person 
passes  through  our  object,  our  tracker  can  re-initialize  to 
the  target  immediately  after  the  total  occlusion  based  on  the 
occlusion  recovery  information. 

Please  refer  to  our  supplemental  video  for  the  details. 

7.  Conclusions  and  future  work 

We  have  proposed  a  novel  co-training  framework  of  gen¬ 
erative  and  discriminative  trackers  with  partial  occlusion 
handling.  Our  algorithm  can  encode  the  global  appearance 


model  of  the  object  in  a  compact  linear  subspace  while 
strengthening  the  discriminative  power  to  separate  the  ob¬ 
ject  and  background.  The  co-decision  process  for  occlusion 
handling,  with  the  help  of  the  local  features  movement  vot¬ 
ing  process,  robustly  detects  the  occluded  region  and  helps 
the  trackers  ignore  that  region  and  update  the  new  model 
consistently.  Moreover,  the  co-training  framework  helps  the 
two  trackers  update  each  other  on-the-fly,  which  is  espe¬ 
cially  helpful  when  each  of  them  fails  during  tracking. 

Currently,  our  tracker  cannot  handle  the  case  when  there 
is  an  abrupt  change  during  the  occlusion  because  there  is  no 
learned  knowledge  to  predict  the  changes  in  the  hidden  re¬ 
gion  according  to  the  revealed  one.  In  the  future,  we  expect 
to  build  a  learning  algorithm  to  cope  with  this  issue.  We 
also  expect  to  develop  specific  trackers  for  different  types 
of  objects  such  as  face,  people,  and  vehicle  whose  model 
can  be  learned  offline. 
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